Prerequisites

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.7
✓ tidyr   1.1.4     ✓ stringr 1.4.0
✓ readr   2.1.0     ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

First steps

Do cars with big engines use more fuel than cars with small engines?

The mpg data frame

mpg

Creating a ggplot

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

Exercises

1. Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

2. How many rows are in mpg? How many columns?

# rows
nrow(mpg)
[1] 234
# columns
ncol(mpg)
[1] 11

3. What does the drv variable describe? Read the help for ?mpg to find out?

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi"…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", "a4 quattro", "a4 quattr…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 3.1, 2.8, 3.1, 4.2, 5.3,…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999, 1999, 2008, 2008, 1999…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manual(m5)", "auto(av)", "man…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "r",…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16, 14, 11, 14, 13, 12, 16…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23, 20, 15, 20, 17, 17, 26…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "r",…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "compact", "compact", "compact", "compa…

4. Make a scatterplot of hwy vs cyl.

ggplot(data = mpg, aes(x = cyl, y = hwy)) +
  geom_point()

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg, aes(x = class, y = drv)) +
  geom_point()

Aesthetic mappings

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy, size = class)) +
  geom_point()
Warning: Using size for a discrete variable is not advised.

# Left
ggplot(data = mpg, aes(x = displ, y = hwy, alpha = class)) +
  geom_point()
Warning: Using alpha for a discrete variable is not advised.

# Right
ggplot(data = mpg, aes(x = displ, y = hwy, shape = class)) +
  geom_point()
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to
discriminate; you have 7. Consider specifying shapes manually if you must have them.
Warning: Removed 62 rows containing missing values (geom_point).

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")

Exercises

1. What’s wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The manual color setting needs to be outside of the aes argument.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? How can you see this information when you run mpg?

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi"…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", "a4 quattro", "a4 quattr…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 3.1, 2.8, 3.1, 4.2, 5.3,…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999, 1999, 2008, 2008, 1999…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manual(m5)", "auto(av)", "man…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "r",…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16, 14, 11, 14, 13, 12, 16…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23, 20, 15, 20, 17, 17, 26…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "r",…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "compact", "compact", "compact", "compa…

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = year))


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = year))
Error: A continuous variable can not be mapped to shape
Run `rlang::last_error()` to see where the error occurred.

4. What happens if you map the same variable to multiple aesthetics?

5. What does the stroke aesthetic do? What shapes does it work with?

6. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.

Common problems

Facets

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)

Exercises

1. What happens if you facet on a continuous variable?

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cty)

It facets along all combinations of the variable.

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

It means that there is no data in the combination of variables.

3. What plots does the following code make? What does the . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

The . allows the user to specify facets by rows or columns.

4. Take the first faceted plot in this section:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Geometric objects

Exercises

1. What deom would you use to draw a line chart?

geom_line()

A boxplot?

geom_boxplot()

A histogram?

geom_histogram()

An area chart?

geom_area()

2. Run this code in your head and predict what the output will look like. Then, rune the code in R and check you predictions.

This code will produce a scatterplot with a fitted line.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

show_legend supresses the legend mappings.

4. What does the se argument to geom_smooth() do?

It contols the standard error shading in the plot.

5. Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

No, they will produce the same plot.

6. Recreate the R code necessary to generate the following graphs.

(plot1 + plot2) / (plot3 + plot4) / (plot5 + plot6)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Statistical transformations

Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default is geom_pointrange() and the default stat for this geom is identity().

ggplot(data = diamonds) +
  geom_pointrange(aes(x = cut, y = depth), stat = "summary")
No summary function supplied, defaulting to `mean_se()`

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list for all the pairs. What do they have in common?

4. What variables does stat_smooth() compute? What parameters control its behavior?

It computes a predicted value, a confidence interval and and standard error.

5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))


ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

Position adjustments

---
title: "Data visualisation"
output: html_notebook
---

# Prerequisites

```{r}
library(tidyverse)
```

# First steps

Do cars with big engines use more fuel than cars with small engines?

## The `mpg` data frame

```{r}
mpg
```

## Creating a ggplot

```{r}
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))
```

## Exercises

**1.  Run `ggplot(data = mpg)`. What do you see?**

```{r}
ggplot(data = mpg)
```

**2.  How many rows are in `mpg`? How many columns?**

```{r}
# rows
nrow(mpg)

# columns
ncol(mpg)
```

**3.  What does the `drv` variable describe? Read the help for `?mpg` to find out?**

```{r}
glimpse(mpg)
```

**4. Make a scatterplot of `hwy` vs `cyl`.**

```{r}
ggplot(data = mpg, aes(x = cyl, y = hwy)) +
  geom_point()
```

**5. What happens if you make a scatterplot of `class` vs `drv`? Why is the plot not useful?**

```{r}
ggplot(data = mpg, aes(x = class, y = drv)) +
  geom_point()
```

# Aesthetic mappings

```{r}
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()
```

```{r}
ggplot(data = mpg, aes(x = displ, y = hwy, size = class)) +
  geom_point()
```

```{r}
# Left
ggplot(data = mpg, aes(x = displ, y = hwy, alpha = class)) +
  geom_point()

# Right
ggplot(data = mpg, aes(x = displ, y = hwy, shape = class)) +
  geom_point()
```

```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")
```

## Exercises

**1. What's wrong with this code? Why are the points not blue?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```

The manual color setting needs to be outside of the `aes` argument. 

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```

**2. Which variables in `mpg` are categorical? Which variables are continuous? How can you see this information when you run `mpg`?**

```{r}
glimpse(mpg)
```

**3. Map a continuous variable to `color`, `size`, and `shape`. How do these aesthetics behave differently for categorical vs. continuous variables?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = year))
```

**4. What happens if you map the same variable to multiple aesthetics?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ, size = hwy))
```

**5. What does the `stroke` aesthetic do? What shapes does it work with?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, stroke = 2), shape = 21)
```

**6. What happens if you map an aesthetic to something other than a variable name, like `aes(color = displ < 5)`? Note, you'll also need to specify x and y.**

```{r}
ggplot(data = mpg, 
       aes(x = displ, y = hwy, color = displ < 5)) + 
  geom_point()
```

# Common problems

# Facets

```{r}
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)
```

```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)
```

```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)
```

## Exercises

**1. What happens if you facet on a continuous variable?**


```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cty)
```

It facets along all combinations of the variable. 

**2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? How do they relate to this plot?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))
```


```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)
```

It means that there is no data in the combination of variables. 

**3. What plots does the following code make? What does the `.` do?**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)
```

The `.` allows the user to specify facets by rows or columns. 

**4. Take the first faceted plot in this section:**

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
```

**What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?**

**5. Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` arguments?**

**6. When using `facet_grid()` you should usually put the variable with more unique levels in the columns. Why?**

# Geometric objects

## Exercises

**1. What deom would you use to draw a line chart?**

`geom_line()`

**A boxplot?**

`geom_boxplot()`

**A histogram?**

`geom_histogram()`

**An area chart?**

`geom_area()`

**2. Run this code in your head and predict what the output will look like. Then, rune the code in R and check you predictions.**

This code will produce a scatterplot with a fitted line. 

```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
```

**3. What does `show.legend = FALSE` do? What happens if you remove it? Why do you think I used it earlier in the chapter?**

`show_legend` supresses the legend mappings. 

**4. What does the `se` argument to `geom_smooth()` do?**

It contols the standard error shading in the plot. 

**5. Will these two graphs look different? Why/why not?**

```{r}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```

No, they will produce the same plot. 

**6. Recreate the R code necessary to generate the following graphs.**

```{r}
plot1 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4) +
  geom_smooth(se = FALSE)

plot2 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4) +
  geom_smooth(aes(group = drv), se = FALSE)

plot3 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point(size = 4) +
  geom_smooth(aes(group = drv), se = FALSE)

plot4 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4, aes(color = drv)) +
  geom_smooth(se = FALSE)

plot5 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point(size = 4) +
  geom_smooth(aes(linetype = drv), se = FALSE)

plot6 <- 
  ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point(size = 4, color = "white") +
  geom_point()

library(patchwork)

(plot1 + plot2) / (plot3 + plot4) / (plot5 + plot6)
```

# Statistical transformations

## Exercises

**1. What is the default geom associated with `stat_summary()`? How could you rewrite the previous plot to use that geom function instead of the stat function?**

The default is `geom_pointrange()` and the default stat for this geom is `identity()`. 

```{r}
ggplot(data = diamonds) +
  geom_pointrange(aes(x = cut, y = depth), stat = "summary")
```

**3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list for all the pairs. What do they have in common?**

**4. What variables does `stat_smooth()` compute? What parameters control its behavior?**

It computes a predicted value, a confidence interval and and standard error.

**5. In our proportion bar chart, we need to set `group = 1`. Why? In other words what is the problem with these two graphs?**

```{r}
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
```

# Position adjustments

